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the me.ximum possible improvement in system performance of 
retrieval using the formal construction over simple term 
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1. Introduction 

Many experiments have been performed to examine the 
application of term classifications to information retrieval. 
It has been shown by Sal ton [1] , Spark, Jones and Jackson [2] 
and many others that a substantial improvement in system 
performance is obtained over simple term retrieval by using 
automatic thesaurus construction. The construction is 
generally based on the co-occurrence of terms in documents. 
It has been pointed out by Doyle [3] that such construction 
may have a number of difficulties. To avoid these difficulties, 
Jackon [4] has proposed a formal construction of term classes 
known as pseudo-classification in which the correlations of 
relevant documents which are not retrieved are raised and those 
of the irrelevant documents which are retrieved are lowered. 
Such construction is based on a given correlation function, 
a set of requests and the relevance judgement of the documents 
with respect to these requests and is outlined as follows. If 
a document is relevant to a request but the correlation between 
the document and the request is not sufficiently high for the 
document to be retrieved, then it is likely that some of the 
terms used in the request, though similar in meaning to, are 
distinct from those in the document. If term classes are 
formed such that two distinct terms in the same cl.iss 
constitute a class 'match', then the terms in the request and 
those in the document which are related semantical ly may score 



2. 



enough class 'matches' so as to raise the correlation between 
the docximent and the request and thus enable the document to 
be retrieved. Similarly, if a document is irrelevint to a 
request but the correlation between the document and the 
request is high enough for the document to be retrieved, then 
terms in the query and those in the document are likely to 
score some class 'mismatches' to bring down the correlation 
so as to preclude the retrieval of the document. It is 
hoped that the construction will bring about the situation 
where every relevant document is retrieved and every 
irrelevant document is rejected. As pointed out by Jackson [4], 
whether a classification derived from a given set of queries 
and a given set of relevance judgement agrees with that derived 
from another set of queries and another set of relevance 
judgement has yet to be explored. 

In this paper, the computational complexity of the formal 
construction is examined. While the construction is shown to 
be 'difficult' computationally, heuristic methods are tried 
on a set of data to find out what improvement this classification 
model can possibly have over simple term retrieval. 
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2« Problem Definition 

We shall assxime all document and request vectors are 
binary. The formal construction of term classes will be 
formulated as follows. 

Definition 2.1. A document D is retrievable (not retrievable ) 
by a request R if f(D,R)>T (f(D,R)sT), where T is a pre-set 
constant, and f is a matching function measuring the closeness 
between the document D and the request R. Normally, the value 
of f is determined by the number of terms in common and the 
number of terms not in common between the D & the R. 

However, if we want f to also measure the similarity in 
meaning of the terms in D and in R, f will have to satisfy 
some more properties. The exact properties which f should 
possess will be given later. 



Definition 2.2. If D = (a^,a2 


, • . • 


,a^) and 
m 


are two binary vectors, then 






D n R = (c^,C2r . . . 


'<=■»' 




D - R = (d^,d2r . . . 






R - D = (e^,e2r . . . 




and 


D u R " (f ^, f 2^ . . . 







are defined as follows. 



c^ = min{a^,b^} 

d. - a. ^ b, + 

e. = b. ^ a, ■ 



f^ = max{a^,b^} 



10 otherwise. 
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We nov/ introduce term "classes" with the intention of 
putting terms in D-R and those in R-D which are similar in 
meaning into the same term "classes". We assume certain 
subsets of the set of index terms to be called "classes". 
These may overlap, so that a term can belong to more than one 
class. With each term t we associate a vector 

t = (t^,t2f . . . f tj^) where 

/ th 

1 if term i is in the i class 



0 if term i is not in the i^^ class 
d if don't care or don't know. 



The word "class" will remain undefined and we shall assume 
that it is synonymous to the word "interpretation". Since a 
term may have more than one interpretation, it may have more 
than one "1" in its k-ternary vector. This is consistent with 
our ordinary usage of words. 

Let us call this vector, the class vector of the index 
term t. 

Definition 2.3. Given two index terms A and B and their k- 
ternary class vectors (a^,a2» • • . »a^) and ih^,h2t * * * t^^) t we 
define the class vector representing AuB by C « ic^,C2t * * * tC^) 
where 

1 if either a^ - 1 or b^ = 1 



^i * 



0 if neither a^ = 1 nor b^ « 1 
and at least one of them » I 



d if a. » b. « d. 
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Note that the above definition makes a dominate a 
'0'. This is natural because a positive response is much 
more important than a negative one. 

Definition 2.4. The number of term matches between a document 
D and a request r is the total number of I's in DnR. 

The number of term mismatches between a document D and a 
request R is the total number of I's in (D-R)u(R-D). 

Definition 2.5. The class match vector , C « (c, ,c, , . . . ,c. ) , 
of two class vectors A « (a^ ,a2, . . . ,a,^) and B = (b^,b2 , . . . ,bj^) 



is defined by 



c. « 



1 if aj^ « bj^ and a^ * d 
0 otherwise. 



Remark; If a^ « b^ * 0 i.e., a (0,0) condition, then there 
is a class match between A and B. However, if A and B are 
document vectors instead of class vectors, then there is no 
term matches according to definition 2.4. This is because in 
general there are too many (0,0) conditions in documents, 
while the number of class matches due to (0,0) conditions is 
a lot less. (See definition 2.3 and diagram 2.1 for the 
construction of class vectors for D-R and R-D) . if we take 
the approach that a (0,0) condition is a match but is less 
important than a (1,1) condition for both term and class 
matches, then a slight modification of the theory developed 
later will carry over. 



The class mismatch vector D « (d^td^, * * * ,d^) of the two 
vectors is defined by 

1 if * b^ and neither a^ » d 

d^ « nor b^ » d 

0 otherwise. 

The situation in which (a^^ « 0, b^ » d) or (a^ = d, b^ « 0) 
or (a^ « d, b^ = 1) or (a^ « 1, b^ « d) or (a^ = d, b^ « d) 
corresponds to a no match condition, i.e., a no match is neither 
a class match nor a mismatch. 

Definition 2.6. The number of class matches between two class 
vectors is given by the number of I's in their class match 
vector. Similarly the number of class mismatches is given by 
the number of I's in the class mismatch vector. 

Based on the above definitions, we require f, the matching 
function, to be non- increasing in the number of class and term 
mismatches and non-decreasing in the number of class and term 
matches . 

Remark ; The reason why we impose the condition of non- decreasing 
instead of increasing in the number of class and term matches is 
that we want to include most natural matching functions in the 
class we defined. An example showing that the stricter condition 
is not appropriate is: 

Let the function f » |^ + frjlr where a and b are 
the number of term matches and mismatches, respectively, 
and a' and b' are the number of class matches and mis- 
matches. It is easily seen that when b' =0, increasing 



a* will not increase the value o£ £• 



Similar remarks apply to the condition of non-increasing 
in the number of term and class mismatches. 

When we calculate the matching function value f(D,R), 
we first have to find the total number of term matches and 
mismatches. We have to delete the terms in common between the 
D and the R when calculating the number of class matches and 
mismatches because any term in DnR would automatically 
contribute considerably to the number of class matches and 
the effect would dominate that due to other terms. 

A diagram illustrating how f (D,R) is calculated is shown 
in Diagram 2.1. 

Definition 2.7. An assessment matrix , Z, of m documents and n 
requests is a binary matrix of order mxn. The assessment 
matrix Z will correspond to the user's relevance judgement, 
i.e. , 

( 

1 if the user judges that D^^ is relevant 



Z(i,j)« ' 



{ 



to R . 
0 otherwise. 



A (D,R) pair, say (D^,Rj), satisfies assessment iff one 
of the following conditions is satisfied: 
i) When Z(i,j) « 1, f(D^,Rj) > T 
ii) When Z(i,j) « 0, f(D^,Rj) s T . 
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Step 1 Obtain D n R, d - R and R - D. 

f 

Step 2 From D n R, obtain the number of term matches. 

Step 3 Prom D - R and the term classification matrix, 
obtain the class vector of D - R (using 
Definition 3.1.3 iteratively) , denoted by 
(D - R)^ . 

Step 4 From R - D and the term classification matrix, 
obtain the class vector of R - D, denoted by 
(R - D)^ . 

Step 5 From (D - R) and (R - D) , obtain the number of 
term mismatches. 

>^^ep 6 From (D - R)^ and (R - D)^ obtain the number 
of class matches and mismatches. 

Step 7 From the number of class matches and mismatches 
and the number of term matches and mismatches, 
calculate f. 



Diagram 2.1. To Calculate f (D«R) . 
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Thus, if a (D,R) pair satisfies assessment, then D is 
retrievable by R iff D is relevant to R. 

The problem we are trying to solve is posed as follows. 

Given m documents, n requests, an assessment matrix Z, 
and a threshold T, find a k-tuple class vector for each index 
term such that as many (D,R) pairs satisfy assessment as given 
by the matrix Z as possible. This is the same as saying that 
as m£my of the relevant documents are retrieved and as many as 
the irrelevant documents are rejected as possible. 

Let us call this problem the satisfiable assessment 
problem . 

If there is a matrix C whose rows represent the index 
terms and whose columns represent the classes, then the above 
problem is the same as manipulating the entries of C so that 
as many entries of Z are satisfied as possible. Let the 
matrix C be called the term classification matrix . Usually, 
some of the terms are known to have certain semantic relations 
with some other terms, i.e., some of the entries of the matrix 
C are fixed. Thus, the satisfiable assessment problems reduces 
to fixing some entries of the matrix C, while varying other 
entries so that as many of the entries of Z are satisfied. 
Let us call this problem the satisfiable assessment problem 
with a partial solution . We shall show in the next section 
that the satisfiable assessment problem with a partial 
fiolution is 'difficult' computationally. 
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3. The Compu tational Complexity of the^Problem . 

[The results of this section were obtained jointly with 
S.K. Sahni.] 

Cook 15] introduced an equivalent class of 'difficult' 
problems known as the polynomial complete p roblems, so far, 
no one has obtained a polynomial algorithm (i.e., an algorithm 
with the number cJ operations bounded by a polynomial function 
of the length of the input) for any one of these problems. 
However, all of these problems can be solved by a non-deterministic 
Turing machine in polynomial time. Whether there is 
deterministic polynomial algorithm for any of these problems 
depends on whether we can simulate an arbitrary non-deterministic 
Turing machine in polynomial time by a deterministic Turing 
machine running also in polynomial time. While we are unable 
to settle the above question, we will show that the satisfiable 
assessment problem with a partial solution is polynomial complete. 

All of the above mentioned 'difficult' problems are 
polynomial reducible in the sense that if and are any two 
of these problems and P^ can be solved in time f (n) , with input 
n, then can be solved in time f(p(n)) where p is a polynomial 
function. A polynomial complete problem is deciding whether a 
given formula in conjunctive normal form having at most 3 
literals per clause is satisfiable (to be explained later), (see [5]). 
Prom this, one can easily show that satis... :.oility with exactly 
3 literals per clause is polynomial complete, see 16]. Thus 
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all we need to claim that our problem is polynomial-complete is 
to show that there is a deterministic polynomial algorithm for 
our problem iff there is one for the exactly 3- literal 
satisfiability problem. 

An example showing a formula in conjunctive normal form 

is as follows. 

Example 3.1. (X^vX2vX4) a (X5vX2vX^vX3^^) a (X^vX3VXg) . 

This formula has three clauses, namely (X^vX2VX^) , tX5^^2^^9^^11) 

and (X^vx^vXg) . Each of the letters or its complement is called 

a literal. Thus X^, X2 ^11 ^® ^ literal. 

Definition 3.1. A formula is satisfiable iff under some 

assignment of 0-1 values to the variable, every clause of the 

formula has the value ' 1 ' . 

In the above example, the assignment x^ = X2 ■ » 1 

results in each clause having the value 1 and so the formula 

is satisfiable. 

The main result of this section is: 

Theorem 3.1. The satisfiable assessment problem with a 

partial solution is polynomial complete. 

Pr oof ; For a proof of this theorem see Appendix 1. 

Since any problem which can be solved by a random access 
machine in deterministic polynomial time can also be simulated 
by a Turing machine in deterministic polynomial time, the above 
problem is also polynomial complete when considering a random 
access machine. 
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4. Heuristic Methods 

In the last section, we showed that iz was very likely 
that our problem would require exponential time. In this 
section, we shall present some heuristic methods which run 
in relatively little time as compared to an exhaustive 
enumeration algorithm but do not guarantee that the number 
of (D,R) pairs satisfying assessment would be the maximum 
possible. 

Notation 4,1, Let ^C^^}^^^^^ be the term classification 

j«l,n 

matrix where i denotes the term, j denotes the class, 
Op(C^j,x->'y) means that C^^j is changed from x to y. There are 
six such operations, namely Op (C^ ^ ,0-»-l) , Op(C^j ,0-»-d) , Op(C^j,l-»- 
Op(C^j,l-^0) , Op(C^j,d-^l) and OpCC^^ ,d-^0) , Let x^ be the i^*^ 
term, y^ the j^^ class. Let (D-R)^ and (R-D)^ be the class 
vectors of (D-R) and (R-D) , respectively. If A is a vector, 
let be its i^^ component. We shall consider every vector 
as a set in the following sense. If A^«l and A is a document 
vector, then x^eA, If A^-d or 0, then x^eA, Similarly 
if Aj«l and A is a class vector, then y^zk. 

Having defined the above six operations, we shall make 
use of them as follows. Suppose we are given a (D,R) pair. 
If the pair satisfies assessment, then there is no need to 
apply any operation. If D is judged relevant to R but 
f(D,R)sT, then an operation or a sequence of operations is 
applied to the pair so as to increase the value of f(D,R), 
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Similarly, if f (D,R) T but D is judged irrelevant to R, then 

another operation or sequence of operations is chosen to 

decrease f (D,R) . Sometimes, a singls operation may not be 

able to bring about any change on a given (D,R) pair* For 

example, (D-R) has term terms t^ and t2 with the class vectors 

of t^ and t2 having a '1' in the j position. Thus (D-R) j«l. 

r 

Suppose the desired change is to convert (D-R) j from 1 to 0. 

Then neither the operation Op(C. ^#l-*0) nor the operation 

I"' 

Op(C. ^,l->"0) can make (D-R)^=0. However Op(C. .,l->'0) followed 

by Op(C ^,l-*0) or the operations in the reverse order convert 
C 

(D-R) j from 1 to 0. A sequence of operations, Op(C^j,x-*y) with 

x,y and j fixed while varying i, applied to a (D,R) pair which 

c c 

brings about a desired change in exactly one of (D-R) ^ and (R-D) j 
is called a composite operation . 

We shall assume that the matching function f is monotonically 
increasing in the number of class matches and monotonically 
decreasing in the number of class mismatches. This restriction 
will be removed later on. One observation we can make is that 
whether f(D,R) increases, decreases or remains unchanged after 
an application of an operation or a composite operation is 
completely independent of the function, f, being chosen. 

This observation is stated as follows. 

Theorem 4.1. Let f and g be two matching functions. If f^ 
and g^ are the two matching function values of a given (D,R) 
pair and f^ and g^^ are their values respectively after an 
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application of an operation or a composite operation, then 

< f; <-> < g; 

Proof: It is sufficient to prove the theorem for the case of 
a composite operation. Let the composite operation be 
Op(C^j, x-^y) for a number of i's. For any given (D,R) pair, 
the composite operation can at most change the value of (D-R)9 
and (R-D)j, i.e., (D-R)^ and (R-D)^ remain unchanged if Ifij, 
Considering the j^^ class between (D-R)j and (R-D) ? before the 
operation is applied, there are three possibilities, namely, 
a class match, a no match, and a class mismatch. After the 
application of the operation, we also have the same three 
possibilities. Thus, an application of a composite operation 
produces one and only one of the following nine? changes. 1) 
a class match is converted to a class match; 2) a class mismatch 
to a class mismatch; 3) a no match to a no match; 4) a mismatch 
to a class match; 5) a no match to a class match; 6) a mismatch 
to a no match; 7) a class match to a mismatch; 8) a class match 
to a no match; 9) a no match to a mismatch. For the first 3 
cases, any matching function remains unchanged. For the next 
three cases, any matching function increases and for the last 
three cases ^ any matching function decreases. 

Based on this theorem, we shall design an algorithm 
which is completely independent of the matching function to 
be chosen. 
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yie now examine whether any arbitrary given (D,R) pair can 
possibly satisfy a given assessment. If DsR or RsO, then 
either (D-R) or (R-D) is empty, which implies that either 

c c 

(D-R) or (R-D) contains don't care entries only. Suppose 

C C 
(D^R) contains don't care entries only, then any (R-D) will 

not give rise to a class match or mismatch. Thus no change in 

the term classification matrix can alter the value of f(D,R). 

Let us consider another situation. The highest value which 

f (D,R) can take on is to have all class matches and no class 

mismatch. If D is judged relevant to R and f (D,R) takes on 

its maximal value but f(D,R)^T, then no change in the 

classification matrix can mak& (D,R) satisfy assessment. 

Similarly if D is judged irrelevant to R and the minimum value 

of f (D,R) is greater than T, then whatever change in the term 

classif i:,ation matrix will not make (D,R) satisfy assessment. 

Let the highest and lowest values of f(D,R) be fjjjQjj and ^^0^ 

respectively. Based on the above discussion, we obtain the 

following proposition. 

Proposition 4.2. A given (D,R) pair cannot satisfy assessment 
under any term classification matrix iff one of the following 
conditions is satisfied 

i) (D,R) does not satisfy assessment under some term 
classification matrix and either DgR or RsD. 
ii) D is relevant to R and fHiQu^*^ 
iii) D is not relevant to R and ^low^*^* 
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If condition (i) is satisfied, it is likely that the user 
gives a false assessment of the (D,R) pair. If either condition 
(ii) or condition (iii) is satisfied, then we should increase the 
number of classes so that fjjjQjj can be increased and fj^Qw ^® 
lowered. Thus, conditions (ii) and (iii) give us an indication 
of the minimum number of classes to be chosen. 

Definition 4.1. A (D,R) pair which cannot possibly satisfy 
assessment is called a discarded pair . 

The next thing we consider is how we should go about 
increasing or decreasing f(D,R) for a given (D,R) pair. Since 
decreasing f(D,R) is analogous to increasing it, we shall 
consider increasing f(D,R) only. Let (D-R)^ and (R-D)^ be 
^^l'^2' • • • '^n^ and (r^,r2r . . . rr^) respectively. It is sufficient 
to examine changing the class vector of (D-R) only, since the 
other case i.e., changing the class vector of (R-D) is very 
similar. The first class of (D-R)^ and (R-D)^ is scanned. If 
d^=r^3i^d, then no operation applied on the first class can 
increase f(D,R). If d^=r^=d, then it takes at least two 
operations to make the first class a class match. Since later 
operations applied to other (D,R) pairs may make either 
^1 ^1 ^ non-d entry, we will postpone making any change 

for dj^ and r^. If ^^J^r^ and we want to use an operation of 
the form Op(Cj^j ,2-»-y) with x^eD-R, we have four cases, namely 
^1*°' ^1*^' r^=lj d^=d, r^=Of and d^=l, r^=0. (For the 

other cases, i.e. (dj^=0 ,rj^«d) and (dj^al,rj^«d) , changing d^^ 
will not increase f(D,R)). We now consider the case 
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(dj^=l,rj^=0) in detail. When Op (C^j^,l->-0) is applied to all terms 
i such that x^eD-R and C^^^l, (D-R)^ will be set to 0, which will 
then increase the number of class matches by 1. Athough this 
operation helps the current (D,R) pair to satisfy assessment, it 
may deteriorate other (D,R) pairs. Thus, we may not want to 
apply this operation until some other criteria are satisfied. 
Pour strategies will be given later, which will decide whether 
a given operation should be applied or not. Suppose the 
operation Op (C^j^,l->-0) is decided not applicable, then we should 
try operation Op (C^j^,l->-d) . This operation may no improve the 
(D,R) pair as much as Op(C^j^,l-»-0) but on the other hand, it has 
more chance of being decided applicable. Suppose that Op (C^^^, l-»-d) 
is decided applicable but the improvement is not sufficient to 
make (D,R) satisfy assessment, then we should try Op (C^j^,d->-l) 
for some i such that x^cD-R and C^^^^d. Note that the failure 
of Op(C^^,l->-0) for all i such that x^cD-R and C^^^^l may not 
necessarily imply the failure of Op(C^j^,l->-d) for all i such 
that x^c D-R and succeeded by Op (C^j^,d->-0) for some i 

such that x^cD-R and C^^^^d. Suppose that the operation is 
applied but the improvement is still net enough. Then the 
next class is considered. If all the classes have been 
considered but the (D,R) pair does not satisfy assessments 
then we repeat the process for the class vector of (R-D) . 
If the (D,R) pair still does not satisfy judgement, then the 
(D,R) pair is stacked and will be handled later on. 
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Diagrams 4.1 and 4.2 give the flowcharts for increasing and 
decreasing a given (D,R) pair. Before we proceed further, 
we will give a detailed example showing how a (D,R) pair is 
processed. 

Example 4.1. Let the matching function be 

f (D,R) « cos(D,R) + X 0.1 

where a' and b' are the class matches and mismatches between 
(D-R)^ and (R-D)^, respectively. 

D « (0,0,0,0,1,1,1,1) 
R = '0,1,1,0,0,1,1,0) . 

Let the assessment be D not relevant to R. So, we want 
f(D,R) s the threshold T-0.46. 

The term classification matrix, C, before this (D,R) 
pair is processed is 

classes 







1 


1 


d 


0 


d 






d 


0 


0 


1 


0 






0 


d 


0 


d 


1 


c - 


terms 


d 


d 


d 


0 


1 






0 


0 


d 


d 


1 






1 


1 


0 


1 


0 






1 


0 


1 


d 


0 






1 


d 


d 


d 


0 


(D-R) - 


(0,0,0,0,1, 


0,0,1) 








(D-R)^ - 


(0,0,d,d,i) 


u (1, 


d. 


d,d. 


0) 



(l,0,d,d,l) 



ERIC 



To increase f(D,R) 
C 



BEST copy AVAILABLE 

Let (D-R)- - li^,d^,....i^) and (R.d)*= - (r^^r^ r^) 

□ 



II. 



I. d^-r^7 



Yes 



4 possible cases 



(i«e« exasine 
next class) 



case 1 



1 



case 2 



T 



case 3 



case 4 











dj-d 




d^-d 















Try Op(C^^,l-^0) 
for all x^CD-R 
and C^j-l 



Is operation allowed 
by criterion (i.e. 
strategy 1,2,3 or 4 
given later)? 



No 



Yes 



Try Op(C^^,l-^d) 
for all x^e(D-R) 
and 



Does (D,R) satisfy 
assessment? 



No 



^ go to A 



No 



Ves 



The (D,R) pair 
is acceptable 



Yes 




Try Op(C^j,d-^0) 
for some Xj^cD-R 
and C^^«d 



No 



nio 



Is operation 
allowed by 
criterion? 



Yes 
[go to A j 



jls j>n? 



Yes 



Have all the 
terms in (R«D) 
been tried? 



I 



Yes 



The (D,R) 
pair is 
stacked 



0 



No 



No 



Interchange 
(D-R) end 
(R-D) and 
interchange 

(D-R)^ and 

IR-D)^ 



la operation 
allowed by 

criterion? 



Yes 



Does 


(D,R) 


satisfy 


assessment? 




/ No 


J J 

■ IS (D 


z — 

•R) jJ-0? 



Yes 



go to A 



Diagram 4.1. (The other cases are similar to case 1) . 
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L.t (D-R)^ - td^^dj^dg d^) and (R-D)^ - (r^^r^ r^) 



To decrease f(0,R) 

C 



j - 1 

x: 



Is Idyt^) one 

of the following 
4 ca^es? 



T 



No 



j-j^l 
(i.e, exenine 
next class) 



Yes 

4 possible cases 



case 1 



Identical to 
corresponding 
part of Diagram 
4*1 



case 2 



case 3 



case 4 







d^-0 




d^-d 




d^-d 






r^»0 











Identical to 
case 1 in 
Diagraa 4.1 



Try Op(C^j,d-*^0) 

for SOBS x^eOnR 

which has not been 
attempted. 



Is operation 
allowed by 
criterion? 



No 



Yes 



No 



every operation 
of the form OplC^yd-^O) 

for x^cDHR have been 

tried? 



Does (0,1 
assessmen 


t) satisfy 

it? 




,Ye. 


The (D,R) pair is 
acceptable 



No 



Yes 



Go to A 




Dlaqrea 4.2. (case 3 is done in detail) 
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(R-D) = (0,1,1,0,0,0,0,0) 
(R-D)^ - (d, 0,0, 1,0) u (C,d,0,d,l) 

= (0,0,0,1,1) 
From (D-R)^ and (R-D)^, we have a '=2 and b'=l 

f(D,R) » cos(D,R) + ^.liZ^ X 0.1 

= 0.5 + .0778 = 0.5778 

Since f(D,R)>T, we want to decrease f(D,R). There is already 
a class mismatch in the first class. Thus we scan the second 
class, we would like to change the second class of (D-R)^ to 
a 1 so that we can have a mismatch in the second class. Thus 
the class chosen is 2. The term chosen is the first non-zero 
term of (D-R) whose second class equals 0 or d, i.e. term 5. 
Thus the operation is Op(C52rO-^l) or Op (c^, ,d->l) . 

Suppose that the operation is not allowed for not 
satisfying certain criteria. Then the next term chosen is 
term 8 of (D-R). Thus the operation is Op (Cg2rd->1) , since 
the second class of term 8 is d. 

Suppose that the operation is again not allowed. Then 
we would like to change the second class of (D-R)^ to a d so 
that we will not have a class match in the second class. The 
term we choose is the first non-zero term of (D-R) whose class 
equals 0. Thus the operation is Op (C^, ,0->d) . 

Suppose the operation is allowed by the strategy. The 
operation is applied and the new classification matrix becomes 
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Classes 



terms 



1 
d 
0 
d 
0 
1 
1 
1 



1 
0 
d 
d 
d 
1 
0 
d 



d 
0 
0 
d 
d 
0 
1 
d 



0 
1 
d 
0 
d 
1 
d 
d 



a 
0 
1 
1 
1 
0 
0 
0 



Now, a' = 1 and b' = 1. 

The new f (D,R) = 0.5600 and we still have to decrease 
f (D,R) . 

The other operations attempted are listed below: 



Op(C53, 


d-*l) 


which 


is 


not allowed; 


Op(Cq3, 


d-*l) 


which 


is 


not allowed; 


0P(C54, 


d-*0) 


which 


is 


allowed. 



At this stage, f(D,R) = 0.5333 and we still have to decrease 
f(D,R). The next operation is 

OpCCgg, l-*0) which is not allowed. 

The next operation is Op{C^^, W) . Since this operation 
fails, there is no need to try OpCCg^, 0 ^d) . After this 
operation, we interchange (D-R) and (R-D) and also (D-R)^ 
and (R-D) and other operations are tried. 

Now, we present the four strategies which decide whether 
a given operation is applicable or not. 

Definition 4.2. A (D,R) pair responds favorably to an 
operation or a composite operation if one of the following 
conditions is satisfied 
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i) if D is relevant to R, then one of the following 
must be satisfied 

a) the number of class mismatches is decreased 
and there is no decrease in the number of 
class matches 

b) the number of class matches is increased and 
tnere is no increase in the number of class 
mismatches , 

ii) if D is not relevant to R, then one of the following 
must be satisfied 

a) the number of class matches is decreased and 
there is no decrease in the number of class 
mismatches 

b) the number of class mismatches increases and 
there is no increase in the num.ber of class 
matches , 

iii) there is no change in the number of class matches 
or mismatches, 

A (D,R) pair rjBsponds strictly favorabl y to an operation 
or a composite operation if either (i) or (ii) is satisfied. 

Definition 4,3, An acceptable (D,R) pair is one which 
satisfies assessment under the current term classification 
matrix. 

A stacked (D,R) pair is one which does not satisfy 
assessment under the current term classification but may 
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possibly satisfy assessment under some other classification 
matrix. 

Each of the strategies which we will present ensures 
that any operation, if applicable, nelps the present (D,R) 
pair in satisfying assessment but also guarantees that the 
previously processed (D,R) pairs behave "reasonably well". 
Hopefully more and more (D,R) pairs are brought into assess- 
ment as more and more operations are performed. 

Strategy 1 . All acceptable and stacked pairs wnich were 
processed previously respond favorably and tno present (D,R) 
pair responas strictly favorably. 

Definition 4.4. A (D,R) pair is undisturbed by an operation 
if tne operation does not change the pair which satisfies 
assessment into one wnich does not . 

Thu^, a (D,R) which does not satisfy assessment is 
undisturbed by any operation. VJhile strategy 1 does not allow 
any deterioration of any (D,R) pair which can satisfy assess- 
ment, strategy 2 will certain deterioration, provided it is 
not "too Dad". 

Strategy 2 . All acceptable (D,R) pairs which were previously 
processed are undisturoed and the current (D,R) pair responds 
strictly favorably. 
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Definition 4.5. The number of favorable changes on a (D,R) 
pair is counted in the following way. 

i) if D is relevant to R, then each of the following 
changes is counted as one favorable change 

a) there is no change in the number of class 
matches and the number of class mismatches 
decreases by 1. 

b) there is no change in the number of class 
mismatches and the number of class matches 
increases by 1. 

ii) if D is not relevant to R, then each of the following 
changes is counted as a favorable change 

a) there is no change in the number of class matches 
and the number of class mismatches increases 

by 1. 

b) there is no change in the number of class 
mismatches, and the number of class matches 
decreases toy 1. 

Thus, if D is relevant to R and a mismatch is converted to a 
match, then the number of favorable changes is 2. 

The number of unfavorable changes on a (D,R) pair is 
counted as above except that the words 'relevant' and 'not 
relevant' are interchanged in (i) and (ii) . 

The number of favorable changes on a set of (D,R) pairs 
is given by the sum of all favorable changes on each (D,R) 
pair in the set. The number of unfavorable changes on a set 
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of (D,R) pairs is calculated similarly. 

Strategy 3 . The number of favorable changes on all acceptable 
and stacked pairs is greater than the number of unfavorable 
changes on all acceptable and stacked pairs. 

While strategy 2 does not allow any (D,R) pair which 
satisfies assessment to deteriorate "too badly", strategy 3 
allows ueterioration of any (D,R) pair to any degree provided 
that the set of all (D,R) pairs which can pocsibly satisfy 
assessment "improves" on the whole. Generally, Lnis will 
improve system performance as measured by recall and precision 
effectively, when a (D,R) pair is judged to be relevant 
(irrelevant) ana all documents wnich rank nigher (lower) tnan 
D with respect to R are relevant (irrelevant) to R, increasing 
(decreasing) the rank of D any further will not improve recall 
ana precision. Thus a better strategy than stiategy 3 would 
be to count a change on a (D,R) pair as favorable only if 
tnere is at least one irrelevant (relevant) document whose 
rank is hignar (lower) than U and it is possible for D to 
pass such a document in ranking. However, any method 
attempting to implement that would involve sorting the 
correlation coefficients of documents with respect to each 
request and a lot of computing time is required. To avoid 
it, we use tne following definition in specifying strategy 4. 



Definition 4.6. A favorable change in a restricted domain 
on a (D/R) pair is a favorable change which satisficjs one 
of the following conditions. 

i) if D is relevant to R, f(D,R)<T2 before the 
operation is applied, 
ii) if D is not relevant to R, f(D,R)>T^ before the 
operation is applied. 
T2 ana are constants (T2>T^) specifying the restricted 
domain we are interestea in. Any document whose correlation 
coefficient with R is greater (less) than H ^ (T^) is assumed 
to be relevant (irrelevant) to R. Thus, increasing (decreasing) 
f(D,R) beyond i:^ (T^) would not improve recall and precision. 

Strategy 4 The number of favorable changes in restricted 
domain on all acceptable and stacked pairs is greater than 
the number of unfavorable changes on all acceptable and stacked 
pairs. 

Diagram 4.3 illustrates how all the (D,R) pairs are 
processed in the first cycle. At the end of the first cycle, 
there are 3 sets of (D,R) pairs, a set containing acceptable 
(D,R) pairs, one containing stackea pairs and the last one 
containing discarded pairs. In the second and subsequent 
cycles, attempts are made to change stacked (D,R) pairs into 
acceptable ones by going through the same flowchart but 
processing the set of stacked pairs only. However, when a 
strategy is applied, all pairs except those discarded are 
taken into consideration. The stopping criteria for the 
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BEST COPY AVAILABLE 



Is there any (D,R) pair 
which has not been 
processed? 



Yes 



Read in the 
(D,R) pair 



No 



Cycle 1 
is completed 



Does it satisfy 






Yes 


Accept the pair 
and save it in 


assessment? 




•■ No 




a stack marked 
acceptable. 



Go to 
A 



Is it possible 
for the pair to 
satisfy assessment? 



No 



This pair is 
discarded 



Go to A 



Yes 



Is there an (any other) 
operation which will [„No 
help the pair to i 
satisfy assessment? I 
r — ' 



r 



This pair is j 
saved in * 
a stack | ' 
marked stacked 



Yes 



No 



No 



Does the operation 
satisfy the condition 
i of the strategy being used? : 



Yes 



Does the pair | 
satisfy assessment? ! 
1 



i Yes 



Accept the pair and 
save it in a stack 
marked acceptable 



Go to A ' 



Diagram 4.3. Main program for cycle 1. 



ERIC 



strateyies arc as follows. For strategies 1, 3 and 4, the 
algorithm halts iff no operation is made between the last 
cycle and the present one or there is no stacked pair. For 
strategy 2, the algorithm halts iff there is no increase in 
the number of acceptable pairs between the last cycle and 
the present one or there is no stackea pair. 

We will show that all the strategies converge. 

Proposition 4.3. Strategy 1 converges in a finite number of 
steps. 

Proof ; Let f = the nuitiber of favorable changes in all acceptabl 
and stacked pairs, f is bounded above by the number of (D,R) 
pairs times the maximum number of favorable changes a (D,R) 
pair can possibly have. f is monotonically increasing in the 
number of operations made because each operation which is 
applied must increase f by at least 1. A monotonic increasing 
function which takes on integer value and is bounded above 
must converge in finite number of steps. 

Proposition 4.4 . Let f = the number of acceptable (D,R) 
pairs. Since the number of acceptable (D,R) pairs is less 
than or equal to the number of (D,R) pairs, f is bounded above 
oy the number of (D,R) pairs. By the definition of strategy 2, 
f either increases or it remains the same from the i^*^ cycle 
to the (i+1)^^ cycle. In the latter case, strategy 2 stops. 
For tne former case, strategy 2 stops by an argument similar 
to that of Proposition 4.3. 
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Proposition 4.5. Strategy 3 (4) stops in a finite number of 
steps. 

Proof : Let f = the number of favorable changes (in restricted 
domain) on all acceptable and stacked (D,R) pairs - the number 
of unfavorable changes on the same pairs. Then apply an 
argument similar to that of Proposition 4.3. 

We are now in a position to modify the algorithm whose 
flowchart is described by diagram 4.3 if the matching function 
f is non-decreasing in the number of class matches and non- 
increasing in the number of class mismatches. If f were a 
monotonic function, then an operation which will help a (D,R) 
pair to satisfy assessment would be one which incr(2ases the 
number of class matches or decreases the number of class mis- 
matches in the case D is relevant to R and which either decreases 
the number of class matches or increases the number of c] asF 
mismatches in the case D is not relevant to R. If f is only 
non-decreasing in the number of class matches and non-increasing 
in the number of class mismatches, then in order to decide 
whether an operation really helps a (D,R) pair to satisfy 
assessment, we not only have to check the above conditions but 
also compute the value of f and see whether any change has been 
made. This is the only change we have to make in the algorithm 
(see the box marked with * in diagram 4.3) . 
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5. Experimental Results 

In this section^ we shall compare the performance of the 
four strategies with the performance of a matching function 
which does not make use of the concept of class matches and 
mismatches. The function chosen is the cosine function used 
in the SMART Project [6]. In Diagram 5.2, this function is 
referred to as f^. The matching function which uses the 
concept of class matches and mismatches is f^CD^R) = cos(D,R) + 
kj^a'-b' 

ai+jpi ^2 ^' ^^""^ ^' number of class matches and 

mismatches, respectively and k^ and k2 are constants. The 
comparison of the two matching functions will be carried out 
in two ways: the percentage improvement of the new function 
using a particular strategy over the cosine function in terms 
of the number of (D,R) pairs satisfying assessment and in terms 
of precision and recall. 

In Dif'.gram 5.2, the results are shown with k^^ = 4 and 
k2 = 0.1. In Diagram 5.1, the aocuments and the request pairs, 
the assessment of the pairs and the threshold are given. 

The percentage improvement of a particular strategy over 
f^ is given by the formula {(no. of (D,R) pairs satisfying 
assessment using that strategy) - (no. of (D,R) pairs 
satisfying assessment using f^)}T{no. of (D,R) pairs satisfying 
assessment using f^)}xlO0%. The number of (D,R) pairs 
satisfying assessment by f^ and by the other strategies is 
shown in Diagram 5.4. 
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Example: 



= (0,0,1,0,1,0,0,1) 
= (1,0,1,1,0,0,1,1) 
R3 = (0,1,1,0,0,1,1,0) 



= (0,1,1,0,1,0,0,1) 

D2 = (0,0,0,0,1,1,1,1) 

D3 = (0,1,0,1,0,1,0,1) • 

= (1,0,1,0,1,0,1,0) 

= (1,1,1,1,0,0,0,0) 

Dg = (0,0,1,1,0,0,1,1) 

= (1,1,0,0,1,1,0,0) 

Dg = (1,1,1,0,0,1,0,1) 

= (0,1,0,0,1,0,1,1) 

= (0,1,0,1,1,0,1,0) 

The number of terms = 8 , the number of classes 
Threshold T = 0.46 . 



= 5 



Assessment matrix Z = 



1 
0 
0 
1 
0 
0 
0 
0 
1 
0 



0 
1 
0 
1 
1 
0 
1 
0 
0 
1 



0 
0 
1 
0 
0 
1 
1 
0 
1 
0 



Diagram 5.1. 
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Number of (D,R) pairs given = 30 

Number of (D,R) pairs discarded using £2 is 6 

i.e., there are 6 (D,R) pairs which can never 
satisfy assessment using £2 with the parameters 

= 4 and k2 = 0.1 . 



Number of (D,R) pairs satisfying assessment by 


^1 


£2 with 
strategy 1 


f2 with 
strategy 2 


f2 with 
strategy 3 


f2 with 
strategy 4 


16 


17 


22 


21 


20 



Initial term classification matrix C 



terms 



0 
d 
0 
d 
0 
d 
1 
1 



1 
0 
d 
d 
0 
d 
0 
d 



d 
d 
0 
d 
1 
1 
1 
d 

classes 



0 

1 
d 
0 
d 
1 
d 
d 



d 
0 
1 
1 
d 
0 
0 
0 



f, = cosine function 



k,a'-b' 
cos(D,R) + ( j^^^t^j^) ) k. 



Diagram 5.2. 
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Precision 
1.0 



0.9 



0.7 



0.6 



0 — cosine function 

1 — strategy 1 

2 — strategy 2 

3 — strategy 3 

4 — strategy 4 




0.5 



0.4 



0.0 



0.2 



0.4 0.6 

Recall 



0.8 



1.0 



Diagram 5.3. 
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Percentage im 
function i 


provement of strategy i over the cosine 
n terms of the number of (D,R) pairs 
satisfying assessment 


Strategy 1 


Strategy 2 ! Strategy 3 Strategy 4 

i 


6.25 


i ' 
37.5 ' 31.25 25 



Diagram 5.4. 



Percentage improvement of strategy i over the cosine 
function in terms of Recall and Precision 


Strategy l 


Strategy 2 


Strategy 3 


Strategy 4 


-4.3 


6.6 


13.1 


13.1 



Diagram 5.5. 
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The percentage improvement of the strategies over f is given 
by Diagram 5.5. The precision and recall curves for tne 
strategies and the cosine function are plotted in Diagram 5.3. 

There are a number of parameters that we can vary in 
performing the above experiments. The first variation involves 
changing the initial term classification matrix. The results 
are shown in Diagrams 5.6 and 5.7. The second parameter 
varied is the order in which the (D,R) pairs are processed. 
Diagrams 5.8 and 5.9 show the results when all the (D,R) 
pairs are processed in reverse order. Next, we vary the 
constants and k^* From the above (D,R) pairs, we find 
the highest value which cosine (D,R) can have among all (D,R) 
pairs is .6708 when both (D-R) and (R-D) are non-empty. If 
the highest value which f2(D,R) is allowed to reach is 1, 
then we should choose k2 = 1 - 0.6708 = .3292. Since the 
number of classes equals 5, using = 4 would imply that 
if ther-a is a match, then f 2 (D,R) 5f ^ (D,R) . Thus, we should 
lower k^ to 2 if we want class mismatches to have some 
importance as compared to class matches. The results of 
using k2 = .3292 and k^ = 2 are shown in Diagrams 5.10 
and 5.11. The computing time for the fcur strategies in 
the above experiments is shown in Diagram 5.12, 

By looking at the experimental results, strategy 2 
is far superior to strategy 1 in terms of the number of (D,R) 
pairs satisfying assessment. This is not at all surprising 
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Percentage improvement of strategy i over the cosine 
function in terms of the number of (D,R) pairs 

satisfying assessment 


Strategy 1 


Strategy 2 


Stratgey 3 i Strategy 4 


25 


31.25 


31.25 1 31.25 

1 1 

-1 



A different initial term classification matrix is used 



Diagram 5.6. 



Percentage improvement of strateyy i over the cosine 
function in terms of Recall and Precision 


Strategy 1 

1— 1 


Strategy 2 


f 

Strategy 3 j Strategy 4 


■ 4.3 

I . 


10.3 i 13.3 11.7 

1 

1 



A different initial term classification matrix is used 



Diagram 5.7. 



Percentage improvement of strategy i over the cosine 
function in terms of the number of (r),R) pairs 

satisfying assessment 



i 

1 strategy 1 


Strategy 2 


Strategy 3 


! 

Strategy 4 ' 


1 

6.25 


25 


12.50 


31.25 


The (D,R) 


pairs are processed in reverse order 




Diagram 5.8. 




r 

' Percentage improvement of strategy i over the cosine 
function in terms oi Recall and Precision 


! Strategy i 

1 ' 


Strategy 2 


Strategy 3 


Strategy 4 


' 6.6 ! 


-0.9 


16.8 


! 

16.8 



The (D,R) pairs are processed in reverse order 

Diagram 5,9. 



Percentage improvement of strategy i over the cosine 
function in terms of the number of (D,R) pairs 

satisfying assessment 


Strategy 1 


Strategy 2 


Strategy 3 


Strategy 4 


43.75 


62.5 


50 


56.25 



Constants = 2 , .3292 
Diagram 5.10. 



Percentage improvement of strategy i over the cosine 
function in ♦■.erms of Recall and Precision 


Strategy 1 


Strategy 2 


i 

Strategy 3 Strategy 4 

L 


12.5 


22.0 


22.6 ■ 22.5 

1 



Constants = 2 / ^2 = .3292 
Diagram 5.11. 
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Computing time of the different 
strategies in the above experiments 




Strategy 1 


Strategy 2 


Strategy 3 


Strategy 4 


Original 
matrix 


23.70 sec. 


32.23 sec. 


48.75 sec. 


39.53 sec - 


Altered 
matrix 


13.93 sec. 


12.83 sec. 


22.24 sec. 


27.07 sec. 


(D,R) pairs 
processed 
in reverse 
order 


17.35 sec. 


14.89 S€!C. 


39.09 sec. 


43.42 sec. 


Changing 
, ^2 to 

- 2 , 
k^ = .3292 


45 sec. 


30.17 sec. 


77.96 sec. 


77.81 sec. 



Diagram 5.12. 




because any operation that is allowed in strategy 1 is also 
allowed in strategy 2 under the same term classification 
matrix, and strategy 1 is restricting too many operations 
to be applicable. In the experiments that we ran, strategy 
2 is even better than strategy 3 in the number of (D,R) pairs 
satisfying assessment, but strategy 3 is better than strategy 
2 in terms of recall and precision. Strategy 3 guarantees 
that any operation that is applied improves the global 
condition of all the (D,R) pairs while strategy 2 is more 
interested in the satisfiability of the curjrent (D,R) pair. 
Strategy 2 allows an operation which deteriorates some of 
the (D,R) pairs that have been processed, provided that the 
deterioration in each pair is not "too much". The operation 
may hurt the global behavior of the (D,R) pairs and thus on 
the whole, we expect strategy 3 to be superior in recall and 
precision. Strategy 4's performance should be better than 
that of strategy 3 but it is not easy to fix the values of 
the thresholds and to obtain good results. 

vmen the parameter is changed from 0.1 to 0.3292, 
mos.t of the (D,R) pairs which were discaraed when = 0.1 
become available for processing. Thus, the performance of 
all the strategies improves significantly. On the other 
hand, a lot more operations have to be performed and the 
computing time increases considerably. When the other 
parameters are changed, i.e., changing the initial term 



classification and the order in which tue (D,R) pairs arc 
procGssed, strategy 2 still performs very well in the number 
of (D,R) pairs satisfying assessment, and strategies 3 and 4's 
performance remains well in terms of recall and precision. 
When a different initial term classification matrix .is used, 
the performance of strategy 1 improves from 6.25% to 25% in 
terms of the number of (D,R) pairs satisfying assessment. 

In general, we believe that the strategies which are 
designed for a particular purpose (strategy 2 maximizing the 
number of (D,R) pairs satisfying assessment and strategies 3 
and 4 improving recall and precision) are rather independent 
of the order in whic. the (D,R) pairs are processed and also 
of the initial term classification matrix. (A particular 
initial term classification matrix may make one strategy 
perform a .lot better and another one a lot worse. But a 
randomly generated one with mostly don't cares as entries 
will give approximately the same result each time.) On the 
other hand, the performance of strategy 1 is highly depenaent 
on the above parameters. 

We shall try strategy 3 on a subset of the ADI collection 
which has 82 documents and 35 requests. The fifst five 
requests are chosen and the (D,R) pairs are chosen satisfying 
one of the following conditions (i) D is relevant to R or 
(ii) D ranks within the top twenty documents in correlating 
with R. The number of (D,R) pairs satisfying the above criterion 



equals 103. The number of terms and classes used is 1187 
and 20, respectively. The results are shown in Diagram 5.13. 
The chosen tnreshold =0.25 with constants = 3 and 

= 0.1. The initial term classification matrix is too 
large to be given nere. The total number of (D,R) pairs is 
103 and strategy 3 accepts 101 pairs and 2 pairs are discarded 
(with no stacked pairs). The computing time spent is 44 minutes, 
but the improvement obtained is unexpectedly good. It has a 
110.9% improvement over the cosine function. 

^' Computati on Time of the Heuristic Methods . 

In this section, we shall give a rough estimation of the 
computing time of strategy 3. In the experiments shown in 
the last section, we saw that strategy 3 usually takes more 
time than the other strategies. Thus a rough bound for 
strategy 3 would serve as a bound for other strategies. 

Let the number of (D,R) pairs be n. Let the average 
number of terms in a document be p and that in a request be q. 
Suppose the number of classes is r. 

If D is relevant to R, then the number of favorable 
changes in one class is at most 2, namely changing a mismatch 
to a no match and then from a no match to a class match. 
Similarly if d is not relevant to R, there are at most 2 
favorable changes per class. Thus the total number of 
favorable changes for a (D,R) pair is at most 2r, implying 
that the total number of favorable changes for all (D,R) pairs 
is bounded by 2rn. From the i^^ cycle to the (i+1)^^ cycle, 
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at least one operation is applied. Taerefore, the number of 
favorable changes increases by at least one. So, the number 
of cycles is at most 2rn. Now in each cycle, all the stack 
pairs are processed rnd the number of stack pairs is less 
than or equal to n. For each pair, the number of operations 
that may be favorable (but not necessarily applied) is at most 
2(p+q)r, because there are at most 2 favorable changes for 
each class and each favorable change may be caused by any one 
of the (p+q) terms (the cardinality of (D-R)u(R-D) is bounded 
by (P+q) ) . For each operation which may be applicable, we 
have to go through all acceptable and stacked pairs, which 
is not more than n. If (D-R) , (R-D) , (D-R)^ and (R-D)^ 
are stored, then checking whether a given operation is 
favorable to a pair or not takes constant time. Thus the 
total time taken is 



I2rn] 



[n] 



number no. of 
of (D,R) 
cycles pairs 

processed 
per cycle 



[2(p+q)r] 

no. of 
operations 
considered 
for each 
(D,R) pair 



[nl 



= 0(n^r^(p+q)) 



no. of 

(D,R) 

pairs 

to be 

checked 

to decide 

whether an 

operation 

is applicable 

or not 



Tnis is really a large number when the number of (D,R) 
pairs is large or the number of classes is large, but this 
number does not represent the actual running time of strategy 
3. In the first cycle, the computing time is 0(n^r(p+q)). 
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For each cycle after the first cycle, the number of stacked 
pairs is extremely small and for practical purposes, can be 
assumed to be bounded by a constant. Thus the amount of 
computer time spent per cycle after tiie first cycle is 
0 ( (p+q) rn) . In all the experiments performed, tha number of 
cycles was at most 4 and in many cases, the number went down 
to 2. A more realistic situation is to assume that the number 
of cycles is a constant. Under that assumption, the computing 
time spent is 0(n^r(p+q)) + 0((p+q)rn) = 0(n^(p+q)r). The 
computer times usea in the different experiments confirm tne 
above calculation. 

7 . Conclusion 

The formal construction of term classes is snown to be a 
difficult process computationally. The approximation to the 
construction by the heuristic methods, especially the tnird 
strategy, is verified by experimental results to be a 
sufficiently close one. However, tne computer time required 
by the heuristic metaods is still too large for any real 
document collection. Future research should be directed at 
getting better heuristic algorithms. Our next set of 
experiments will be aimed at finding out whether the construction 
obtained by a given set of queries and user's assessment agrees 
with that obtaJ.ned by a different set. 
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APPENDIX 

(The results of this appendix were obtained jointly 
with S.K. Sahni. ] 

Theorem 1 . For every formula in conjunctive normal form 
having exactly three literals per clause, there exists a 
set of documents D's, a set of requests R's, a matching 
function f, a threshold T, an assessment matrix Z and a 
partial solution for the satisfiable assessment problem 
such that the formula is satisfiable <:=> all (D,R) pairs 
are satisfied. 

Proof: Let P be a formula in conjunctive normal form with 

n variables x^,X2»...,x^ and m clauses ^1*^2' ' * * '^m* ^^^^ 

m 

P is of the form a c. . Further, eacii clause c. has 

i=l ^ ^ 

exactly three literals and is of the form c, v c.^ " 

il i2 i3 

where c^^ is either a variable or its complement. We shall 
show how to obtain a set of (D,R) pairs, a matching function 
f , a threshold T, an assessment matrix Z and a partial 
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solution such that all (D,R) pairs are satisfied iff the 
formula P is satisfiable. 

Corresponding to each clause c. there will be one R. 
and one D^. We shall have 3n +m distinct terms: 

{g.}"^ ; {t.}" ; {2.}" and {i.}" . 
^ j = l ^ j = l ^ j = l 3 

For each clause c^ construct and as follows. R 
and have four terms each. The terms of are given by 

i) 9^ is in 

ii) is in iff the literal occurs in c^. 

iii) z. is in D. iff the literal x. occurs in c. 

1 11 

The terms of R^ are given 

i) g. is in D. 
1 1 

ii) t^ is in iff the variable x^ is in clause c^ 
(variable x^ occurs in iff either tne literal 
or the literal x^ occurs in c^) . 

Each class vector has (n + 1) components and let the 
partial solution be specif iea as follows. 

i) for the terms 9it'^2' " ' '^m' ^'^^ class vector for 

each such term is (d, , . . . ,d„ , 1) where each d. 

1 ^ n i 

is a. 

ii) for the terms t^^ , t2 , . . . , t^^, the class vector 
for a term t^ is 

^'^l'°2' " * ''^i-l'^i'^i+l' • • • '^n+1^ for l<i<n 
where is to be assigned 0 or 1 and each d^ 
is d. 
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iii) for the terms 2,,2-...,2^, the class vector for 

X. z . n 



IS 



iv) for the terms z^,z^, , , , ,z^, tne class vector for 



IS 



The as appeared in (ii) above corresponds to the 
variable of formula P, i.e., ^i'V^^ l<isn. 

The matching function f is given by f(D^,Rj) = 4a '-b' 
where a' and b' are the class matches and mismatches between 
the class vectors of (D^-R^) and (Rj-D^) , respectively. The 
threshold T is 0 and the entries of the assessment matrix are 
all I's. 

We will first shown that all (D^,Rj) pairs with i«j 
satisfy assessment. For i*j , g.e(D.-R.) and g.e(R.-D.). 
Thus the class vector of (D^-R^) = (X,X,X, . . . ,X, 1) where X 
can be 0 , 1 or d but the number of non-d entries among the 
X's is 3. This is because each clause has exactly 3 literals 
and each term which corresponds to a literal has exactly 1 
non-d entry by the above construction. Similarly the class 
vector of (R^-D^) = (X,X,X, . . . ,X, 1) , with the same condition 

on X. There is at least one class match between (R. -D.)^ 

C ^ ' 

and (D^-Rj) because there is already a class match in the 

St 

(n+1)" position. Since the number of non-d entries in 
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each class vector is 4 (including the 1 in the (n+1)^^ 
position) , the number of class mismatches <3. By the matching 
function f given above, f(D^,Rj)>l>0 and thus satisfies 
assessment. 

For i=j=^, we are considering the document recuest pair 
(D^,R^). By the construction above, (D.-R )^ = (6w6,,...,6 
where 

1 if literal x e clause c„ 

P 

6^ = ^ 0 if literal e clause c^ 
d otherwise. 

and there are exactly 3 non-d entries among the 6's. 
Similarly, (R.-D.)^ = ^^1 '^2 ' * * * '^'n+1^ 

f yp if Xp or x^ e clause c^ 
w J 

p \ 

! 

i d otherwise. 

and there are exactly 3 non-d entries among the w's. 

The positions in which the y's occur in (R.-D )^ 

1 i' 

are exactly where tfte non-d entries in (D.-R.)^ occur. 
Thus if c^ = (X^ V X2 V x^), then 

(D^-R^)^ = (l,l,0,d,...,d) 

and 

If the above clause is satisfiable, then Xj^ = 1 or 



= > y^^ = 1 or y^, = 1 or = 0 
<=> there is at least a class match between (D^-R^)^' 

and (R.-D.)^ 
1 1 

<=> f(D^,R^) > 1 > 0 

<=> the (D,R) pair satisfies assessment. 

Since this is true for all clauses and all {(D^,R^)}?_^ 

pairs and all 1(D. ,R )}. . pairs satisfy assessment, the 

i J 1 * J 

formula is satisfiable iff all (D,R) pairs satisfy assessment. 

Note that the above construction can be done in deterministic 
polynomial time. 

Lemma 2 ; If the satisfiable problem with a partial solution can 
be done in deterministic polynomial time, then we can decide if 
all the (D,R) pairs can be satisfied. 

Proof ! Let the number of (D,R) pairs be k. The maximum 
number of (D,R) pairs satisfying assessment =k iff all (D,R) 
pairs can be satisfied. Therefore, deciding whether all (D,R) 
pairs can be satisfied involves comparing the maximum number 
of (D,R) pairs satisfying assessment with k. 

Theorem 3 ; If there is a deterministic polynomial algorithm 
for the exactly 3 literal satisfiability problem, then there 
is one for the satisfiable assessment problem with a partial 
solution. 

Proof i If there is a deterministic polynomial algorithm for 
the exactly 3 literal satisfiability problem, then any non- 
deterministic turing machine running in polynomial time can 
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bo simulatca by a dotcrministic turing machine in polynomial 
time (see [lij ) • 

We shall construct a non-deterministic turing machine 
which checks if the number of (D^R) pairs satisfying assess- 
ment is greater than a given number* 

Let the input of the non-deterministic turing machine to 
be constructed as follows, 

l^gv all the (D,R) pairs, their assessment and a partial solution 

where g is a number. 

The non-deterministic turing machine guesses which terms 
are in which classes* Then it checks if the number of (D,R) 
pairs satisfying assessment is greater than g, rf it is^ 
then it accepts; otherwise, it rejects. This non-detorministic 
turing machine runs in polynomial time because by definition 
of non-determinism the guess is always correct and the time 
required to check the number of (D,R) pairs satisfying assess- 
ment is polynomial. 

By hypothesis, there is an equivalent deterministic 
turing machine running in deterministic v^olynomial j.imc» 

Let us give this deterministic turing machine the 
following input, 

#k--i# all the (D,R) pairs, their assessment and a partial solution 
where k is the number -^f (D,R) pairs, and i=l. 



The deterministic taring machine mu. . halt in polynomial 
time and either accepts or rejects the input. If it accepts, 
then by the construction of the turing machine, all the (D,R) 
pairs are satisfied. If it does not accept, then i is 
incremented to 2. The above process is repeated until the 
least value of i is found .such that the deterministic turing 
machine accepts the input. Note that when i = k + 1, the 
turing machine must accept and thus the process takes no 
more than polynomial time, when the least value of i is 
found such that the turing machine accepts, the maximum 
nuiTiber of (D,R) pairs satisfying assessment (k - i + 1) . 

From the above theorems, we obtain the following results. 

Theorem 3.1 The satisfiable assessment problem with a 
partial solution is polynomial complete. 
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